Concept-based Text Clustering
نویسنده
چکیده
Thematic organization of text is a natural practice of humans and a crucial task for today’s vast repositories. Clustering automates this by assessing the similarity between texts and organizing them accordingly, grouping like ones together and separating those with different topics. Clusters provide a comprehensive logical structure that facilitates exploration, search and interpretation of current texts, as well as organization of future ones. Automatic clustering is usually based on words. Text is represented by the words it mentions, and thematic similarity is based on the proportion of words that texts have in common. The resulting bag-of-words model is semantically ambiguous and undesirably orthogonal—it ignores the connections between words. This thesis claims that using concepts as the basis of clustering can significantly improve effectiveness. Concepts are defined as units of knowledge. When organized according to the relations among them, they form a concept system. Two concept systems are used here: WordNet, which focuses on word knowledge, and Wikipedia, which encompasses world knowledge. We investigate a clustering procedure with three components: using concepts to represent text; taking the semantic relations among them into account during clustering; and learning a text similarity measure from concepts and their relations. First, we demonstrate that concepts provide a succinct and informative representation of the themes in text, exemplifying this with the two concept systems. Second, we define methods for utilizing concept relations to enhance clustering by making the representation models more discriminative and extending thematic similarity beyond surface overlap. Third, we present a similarity measure based on concepts and their relations that is learned from a small number of examples, and show that it both predicts similarity consistently with human judgement and improves clustering. The thesis provides strong support for the use of concept-based representations instead of the classic bag-of-words model.
منابع مشابه
خوشهبندی اسناد مبتنی بر آنتولوژی و رویکرد فازی
Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...
متن کاملNatural scene text localization using edge color signature
Localizing text regions in images taken from natural scenes is one of the challenging problems dueto variations in font, size, color and orientation of text. In this paper, we introduce a new concept socalled Edge Color Signature for localizing text regions in an image. This method is able to localizeboth Farsi and English texts. In the proposed method rst a pyramid using diff...
متن کاملLearning Taxonomy for Text Segmentation by Formal Concept Analysis
In this paper the problems of deriving a taxonomy from a text and concept-oriented text segmentation are approached. Formal Concept Analysis (FCA) method is applied to solve both of these linguistic problems. The proposed segmentation method offers a conceptual view for text segmentation, using a context-driven clustering of sentences. The Concept-oriented Clustering Segmentation algorithm (COC...
متن کاملA Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملOntology-based Concept Weighting for Text Documents
Documents clustering become an essential technology with the popularity of the Internet. That also means that fast and high-quality document clustering technique play core topics. Text clustering or shortly clustering is about discovering semantically related groups in an unstructured collection of documents. Clustering has been very popular for a long time because it provides unique ways of di...
متن کامل